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Abstract 

Semi-supervised clustering is an very important topic in machine learning and computer vision. The 
key challenge of this problem is how to learn a metric, such that the instances sharing the same label are 
more likely close to each other on the embedded space. However, little attention has been paid to learn 
better representations when the data lie on non-linear manifold. Fortunately, deep learning has led to 
great success on feature learning recently. Inspired by the advances of deep learning, we propose a deep 
transductive semi-supervised maximum margin clustering approach. More specifically, given pairwise 
constraints, we exploit both labeled and unlabeled data to learn a non-linear mapping under maximum 
margin framework for clustering analysis. Thus, our model unifies transductive learning, feature learning 
and maximum margin techniques in the semi-supervised clustering framework. We pretrain the deep 
network structure with restricted Boltzmann machines (RBMs) layer by layer greedily, and optimize our 
objective function with gradient descent. By checking the most violated constraints, our approach updates 
the model parameters through error backpropagation, in which deep features are learned automatically. 
The experimental results shows that our model is significantly better than the state of the art on semi- 
supervised clustering. 


1 Introduction 


In this paper, we investigate the semi-supervised clustering with side information in the form of pairwise 
constraints. In general, a pairwise constraint between two examples indicates whether they belong to the same 
cluster or not, which provides the supervision information: a same-label (or must-link) constraint denotes 
that the pair of instances should be partitioned into the same cluster, while a different-label (or cannot-link) 
constraint specifies that the pair of instances should be assigned into different clusters [55[ j2S GU] ■ 

Semi-supervised learning with pairwise constraints, has received considerable attention recently, especially for 
classification and clustering m ia m m is m m- On the one hand, it is relatively easy to decide whether 
two items are similar or not from human in the loop because it often involves little effort from users. On 
the other hand, the maximum margin techniques have shown promising performance on classification tasks, 
and thus it has been widely used in semi-supervised clustering [571 US US1301 HU- In general, traditional 
semi-supervised clustering approaches either learn a distance metric based on the pairwise constraints, or 
leverage discriminative methods, such as k-nearest neighbor (kNN) and support vector machines (SVM) for 
better clustering performance. However, to collapse examples that belong to the same cluster approximately 
into a single point cannot always be achieved with simple linear transformations, especially when the data 
lie on an non-linear manifold. Although kernel methods are widely used for non-linear cases, it is a shallow 
approach and needs to specify hyper parameters in most situations E3ED]- Fortunately, recent advances in 
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the training of deep networks provide a way to learn non-linear transformations of data, which are useful for 
supervised/unsupervised tasks [5] 2]- 

Inspired by feature learning mm®, we propose a deep transductive semi-supervised clustering approach, 
which inherits both advantages from deep learning and maximum margin methods. Our method can learn 
features automatically from observation, kind of learning a metric as in (31] . However, unlike the linear 
mapping, e.g. Mahalanobis metric emeu, our method can learn a non-linear manifold representation, which 
is helpful for clustering and classification [2] . With the learned features as the input to the semi-supervised 
maximum margin clustering framework, we can learn the clustering weights. To leverage the unlabeled data, 
we also incorporate transductive learning to improve the clustering analysis. Through backpropagation, 
our approach can learn discriminative features via maximum margin techniques. Hence, our model unifies 
maximum margin, semi-supervised information and deep learning in an joint framework. We pre-train our 
model with stacked RBMs for feature representations firstly. And then we compute the gradient w.r.t. 
parameters and optimize our objective function in an alternative manner: data representation and model 
weights optimization with gradient descent. We test our model over a bunch of data sets and show that it 
yields accuracy significantly better than the state of the art. 

The outline of this paper is as follows. In Section 2, we review the related work. Then, we present the model 
in Section 3. Section 4 present results of our experiments with the new techniques on a few widely used data 
sets. Finally we conclude the paper. 


2 Related work 


The senri-supervised clustering with partial labels generally explores two directions to improve performance: 
(1) leverage more sophisticated classification models, such as maximum margin techniques j2B, SOI; (2) learn 
a better distance metric [23], ,30]. 

The maximum margin clustering (MMC) aims to find the hyperplanes that can partition the data into 
different clusters over all possible labels with large margins [33] G23 [35]. Nevertheless, the accuracy of the 
clustering results by MMC may not be good sometimes due to the nature of its unsupervised learning M- 
Thus, it is interested to incorporate semi-supervised information, e.g. the pairwise constraints, into the 
recently proposed maximum margin clustering framework. Recent research demonstrates the advantages by 
leveraging pairwise constraints on the senri-supervised clustering problems [EH H1I31 CO 03 [3] • In particular, 
COPKmeans [11] is a semi-supervised variant of Kmeans, by following the same clustering procedure of 
Kmeans while avoiding violations of pairwise constraints. MPCKmeans [3] extended Kmeans and utilized 
both metric learning and pairwise constraints in the clustering process. More recently, ( 20 ] show that they 
can improve classification with pairwise constraints under maximum margin framework. |34] leverage the 
margin-based approach on the semi-supervised clustering problems, and yield competitive results. 

How to learn a good metric over input space is critical for a successful semi-supervised clustering approach. 
Hence, another direction for clustering is to learn a distance metric [35J E±3] El 032 El '3D] which can reflect the 
underlying relationships between the input instance pairs. The pseudo-metric [23] parameterized by positive 
semi-definite matrices (PSD) is learned with an online updating rule, that alternates between projections onto 
PSD and onto half-space constraints imposed by the instance pairs. [ 32 ] proposed to learn a distance metric 
(Mahalanobis) that respects pairwise constraints for clustering. In [6], an information-theoretic approach to 
learning a Mahalanobis distance function via LogDet divergence is proposed. Recently, a supervised approach 
to learn Mahalanobis metric is also proposed in [3t)i, by minimizing the pairwise distances between instances 
in the same cluster, while increasing the separation between data points with dissimilar classes. To handle 
the data that lies on non-linear manifolds, kernel methods are widely used. Unfortunately, these non-linear 
embedding algorithms for use is shallow methods. 

On the other hand, recent advances in deep learning [HEHlia have sparked great interest in dimension 
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reduction m no and classification problems pn ns. In a sense, the success of deep learning lies on 
learned features, which are useful for supervised/unsupervised tasks [SET. For example, the binary hidden 
units in the discriminative Restricted Boltzmann Machines (RBMs) [TSJ [5] can model latent features of 
the data that improve classification. The deep learning for semi-supervised embedding m extends shallow 
semi-supervised learning techniques such as kernel methods with deep neural networks, and yield promising 
results. The work of |24| is most related to our proposed algorithm. It presented deep learning with support 
vector machines, which can learn features under discriminative learning framework automatically with labeled 
data. However, their approach is totally supervised and for classification problems, while our model is for 
semi-supervised clustering problems. Compared to conventional methods, our model consider both feature 
learning and transductive principles in our semi-supervised clustering model, so that it can handles complex 
data distribution and learns a better non-linear mapping to improve clustering performance. 


3 Deep Transductive Semi-supervised Maximum Margin Cluster¬ 
ing 


In this section, we will introduce the transductive semi-supervised maximum margin clustering, with deep 
features learned simultaneously in an unified framework. 


3.1 Overview of our approach 

Let X = {xj}^ (xi £ R-°) be a set of N examples, which belongs to K clusters called Z. In addition 
to the unlabeled data, there is additional partially labeled data in the form of pairwise constraints C = 
{(x,, Xj , S(zi = Zj)}, which is a kind of side information to provide whether the two instances (xj,Xj) are 
from the same cluster or not (indicated by the S function). Most methods attempt to learn weights w fc £ R 13 , 
for each cluster k = [1, A"], to make these constraints satisfied as much as possible. 

Instead of learning a linear mapping or Mahalanobis metric [231130] . we are interested in a non-linear mapping 
function. To make it easy to understand, suppose we have learned a nonlinear mapping function / : R 13 —> R d . 
Then, for each instance x £ X, we can get its embedding code h = /(x) (note that the pairwise constrains 
also are kept in the coding space). Then given the learned features h, we leverage semi-supervised maximum 
margin clustering to partition the data. Just like the multi-class classification problems [25] . we use the joint 
feature representation <E> (h, z) for each (h ,z) £ X x Z 


$(h,z) = 


h • S(z = 1) 


h • S(z = K) 


(1) 


where <5 is the indicator function (1 if the equation holds, otherwise 0). Correspondently, the hyperplanes 
for the I\ clusters can be parameterized by the weight vector W £ R( Kxd ) xl } which is the concatenation of 
weights w fc , for k = {1,..., A'}. In other words, W[(fc — l)xd+l:kxd\ = w fc . The clustering of testing 
examples is done in the same manner as the multiclass SVM [25] . 


max W T $(/(x),z) (2) 

ze\i,K] 


For inference, we first project data into hidden space with function / and then do clustering analysis. The 
problem left is how to learn the weight parameter W and the projection function /. 
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3.2 Objective function 


We would like to extend semi-supervised clustering with deep feature learning. Deep learning consists of 
learning a model with several layers of non-linear mapping. As mentioned before, h £ is the mapping 
code with function /, which is non-linear mappings defined with L-layers neural network, s.t. 


h* = /(Xj) = f L o fL—i o • • • o /i(Xi) 

S ---V- ' 

L times 


(3) 


where o indicates the function composition, and fi is logistic function with the weight parameter 61 respec¬ 
tively for each layer l = {1,.., L}, refer further to Sec. 3.3 for more details. With a little abuse of symbols, 
for any input x, If we denote the output of the Z-th layer as /i_>./(x), then we can get h = /i_>l(x). 


In a similar manner as in [2011341 , we will incorporate the pairwise constraint information into the margin- 
based clustering framework. In addition, we leverage the unlabeled data to separate clusters in large margins, 
by following transductive learning. Specifically, given the pairwise constraint set C = {(x,;, Xj, 6 (zi = Zj))}, 
we first project the dataset X into embedded space and minimize the following transductive semi-supervised 
objective function 


. A 
mm — 
w,® 2 


||W|| 2 


S.t. 




UK 


£& 

ieu 


Vs»i, Si2 € Z,s a ^ Si2 jif (h;i,hj2 , &(zn, £12)) £ U 
max W T $(h t i,h i 2,za,-i2)- 

Zil=Z.2 

W T $(h 1 i,h ? ;2,Sii,S22) > 1 - Vi,Vi > 0 
VSj'l,Sj '2 ^ ^ 5 & j 1 — .Sy' 2 - if (h j x , hj 2 5 ^ ( Zj 1 , Zj 2 )) £ C 
max W T $(h J -i,h J ' 2 , ^1,^2)- 

z 3 l=F z i2 

W T $(hjx, hj 2 , Sji,Sj2) > 1 - Vj,Vj > 0 
V* £ U 7 Vsi ^ Zi £ Z 

maxW T $(h„ Zi ) - W T d>(hi, s, ; ) > 1 - 6; 

Zi 


(4) 

(5) 

( 6 ) 
(7) 


where W is the clustering weight in the over the learned feature space, © = are the weights for each 

layer in the deep architecture, and h, is the mapping code from x^ via Eq. §C'+ = {(h i ,h J ,^ i = ^))l^ = 
Zj} are the same label pairs, with the total number of pairwise constraints n + = |C + |, C = {(h;, h^, S(zi = 
z j))\zi / Zj} are different-label pairs, with n~ = \C~\. U is the number of the unlabeled data (instances), 
not belong to any pairwise constrains. For convenience, we define $(h i; hj, Zi, Zj) = 4 , (h i ,Zj) + $(hj , £j), 
which means the mapping of a pairwise constraint as the sum of the individual example-label mappings. The 
multi-layers non-linear mapping function / projects x^ into h^, for i £ [1, iV]. Instead of a linear mapping, 
the advantage of using a deep network to parametrize the function / is that a multi-layer network is better at 
learning a non-linear function that is presumably required to collapse classes in the latent space, in particular 
when the data consists of very complex non-linear structures. 

Eqs. [5] and [6] specify the conditions that need to be satisfied, which means that the score for the most possible 
assigning scheme satisfying the constraints should be greater than that for any other assigning scheme with 
large margins. More specifically, for any pair (hj, hj, 1) £ C + , it requires that the largest score for assigning 
(hj,hj) into the same cluster should be greater than that for assigning the pair into different clusters by at 
least 1 (soft margin can be applied here too). Analogously, for any dissimilar pair (hj, hj, 0) £ C~, the score 
that they are assigned into the most two different clusters should be greater than that for partitioning them 
into the same cluster. 
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Eq. [7] is from the principles of transductive learning, which indicates that the score of the most assigned 
cluster label is greater at least 1 than that of the runner up from the rest clusters. 

The constrained optimization problem in Eq. [4] is hard to solve because the first inequality Eq. [5] and the 
second inequality Eq. [6] impose all the possible combinations of two clusters for each pairwise constraint. 
Thus, we transform it into the following equivalent unconstrained function which it is generally easier to 
solve 


min — IIWII 2 
2 " " 


1 


1 - 


max W J $(hii,h i2 ,z»i,Zi 2 )- 

(h il ,h i 2 ,l)eC + 


max W T $(h a ,h i2 , Sll , Sl2 ) 


+ — < 1 - max W r T(h fl ,h j2 , z jx ,z j2 )~ 

n [ zji^z j2 

(hji ,hj2,0)GC - 
tT* 


max W $(hji, h j2 , Sj X , s j2 ) 


—— y]{l- [maxW T <f>(hi, z t ) - max W 7 $(h i; Sj)]} 
U A z »■ -*■ * ■ 


ieu 


Si^Zi 


(8a) 


(8b) 

(8c) 


where {#}+ = max(x , 0) and h, is the projected code of x t using Eq. [3] The formula[8a|specifies the condition 
that need to be satisfied for the same label pairwise constrains, while formula [8b] denotes the conditions for 
different-label pairs. The last equation is corresponding to transductive constraints in Eq. [4j 

In the objective function, we need to estimate the parameters, the weight W, as well as the weights Qi for 
each layer l £ [1, L\ in the deep network. From the objective function, we can compute the gradients w.r.t. W 
and 61 for l £ [1 ,L\ (via backpropagation) respectively, and gradient-based methods can be used to optimize 
it. Note that h (we ignore the subscript for convenience) in the objective function Eq. [S] is the non-linear 
embedding code from x. Thus, the objective function is not convex anymore, and we can only find a local 
minimum. In practice, we find L-BFGS cannot work well and easily trap into a bad local minimum. In our 
work, we use (sub)gradient descent to optimize the objective function, by projecting the training data with 
/ and optimizing the objective function in an alternative manner. 


3.3 Parameter learning 


We learn the parameters in an alternative manner: (1) data projection, given the model parameters; (2) and 
then optimize model parameters with gradient descent. To compute the gradients of the parameters, we need 
to find the most violated constraints first. For the same label pairs, we have the following most violated set: 


A + = j (h»,hj,<5(zii = Zi 2 )) £ C + \ max W T $(h a , h i2 , Zn, z i2 ) - max W T $(h a , h i2 , s a , s i2 ) < 1 \ (9) 
( Silvan J 


For the different-label pairs, we denote the most violated set as 


A = 1 (hj,hj, 5(zj! = Zj 2 )) £ C \ max W T $(hji, h i2 , Zju Zji) ~ max W T $(h i i, h j2 , sjr, s j2 ) < 1 \ 
f Zjl^Zj 2 Sj!—Sj 2 J 


( 10 ) 
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Then, we compute the gradient w.r.t. W 


dW = AW+ 


(hi 


E 

,hi2,l)eA+ 


<h(h.a,h t2 ,4,4) - $(hii,h i 2 ,z a ,z i2 ) 


1 

n~ 


E 

(hji,hj 2 ,0)£j4 


$(hji, h j2 , Z n ,Z j2 ) - ^{h jl ,h j2 ,Z+, 2+) 


E 

ieu 


UK 


$ ( h M^+) - $(hi,s+) , 


( 11 ) 


where {z^,zf 2 ) = max 2il=z<2 W T $(h a , h i2 , z a , z i2 ), (z n ,z i2 ) = max Zil7 i Zil W T $(h il , h i2 , z n , z i2 ); and for 
the unlabeled set zf = max,. $(hi, 2 i) and sf = max s .^ z + $(hi,Sj) 

In order to learn discriminative features, we also need to estimate the weights in the multi-layer network. 
Note that for each pair (h,, h ? ), if it violates the constraints in Eqs. [5] and [dj then we can compute the 
gradient w.r.t. h* and hj respectively, which will be used to calculate the gradients of 6 i for l G [1 ,L\ in the 
deep network. We use H = [hi, h 2 ,..., hjv] as the concatenation of all the hidden codes, where H G TZ dxN , 
with each column H(:,?’) = h,. 


For the positive pairs, we have 


‘ ih « = -;^D w = 


<ih i2 = --5;[w,. 



(12a) 

(12b) 


where W r + indicates the weight vector corresponding to the cluster label z^[ in the whole weight matrix W. 
More specifically, W z + = W[( 2 ^ — 1) x d + 1 : zf x x d] 

For the negative pairs, we can get 


dh.ji 

dh j2 


-—Ei w ~- 

n — /—/ L z nl 


3 1 


— Et w ~- 

n~ 

U 


W.+ ] 

Z 3 1 

W.+ ] 

Z 3 1 


(13a) 

(13b) 


For the unlabeled instances, we have 

^ = -c 4E[ w , s + - w E ( 14 ) 

i 

Given the gradient of dh, for each hidden code, we can get the gradient w.r.t. H as 

dH(:,i)=dhi (15) 

where dh, can be calculated according to Eqs. [12] and [13] Then, we can calculate the gradients w.r.t. 
lower level weights with back-propagation. For example d0£ = dH x £(<%”) • (1 — /i_ > i(T’))), where x 
representas matrix multiplication, and • indicates pointwise product. 

Initialization: We used stacked RBMs to initialize the weights layer by layer greedily in the deep network, 
with contrastive divergence [T2] (we used CD-I in our experiments). Note that we used gaussian RBMs for 
the continuous data in the first layer, otherwise we used binary RBMs. 
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Algorithm 1 

l: Input: the training data X, pairwise constraints C, the number of clusters K, the number of iterations 
T, A, and /3: 

2: Initialize W; 

3: Initialize w; for l = {1, ...,L} layer-by-layer greedily; 

4: for i = 1; i <= T; i + + do 

5: if the objective in Eq. [8] has no significant changes, break; 

6: project all training data X into latent space via Eq. [3} 

7: find the most violated constrains according to Eqs. and [lO] 

8: compute the gradient w.r.t. W via Eq. |11[ 

9: compute the gradient w.r.t. H via Eq. |15[ 

10: compute the gradient w.r.t. © = with backpropagation; 

ll: update the parameters with gradient descent via Eq. [lG} 

12: end for 

13: Return model parameters W and {w i}f =11 as well as average accuracy; 


In our deep model, the weights from the layers 1 to L are 0/ respectively, for l = {1,and the top 
layer L has weight 9l. We first pre-train the L-layer deep structure with RBMs layer by layer greedily. 
Thus, our deep network can learn parametric nonlinear mapping from input x to output h, / : x —> h. 
Specifically, we think RBM is a 1-layer deep network, with weight 9 For example, for 1-layer DBN, we have 
h = /i(x) = logistic( 6 ^[x, 1 ]), where we extend x £ K 15 into [x, 1 ] £ !A' D+1 ) in order to handle bias in the 
non-linear mapping. Given the output of the current layer as the input to the next layer, we can learn each 
layer weight greedily. 

As for the clustering weight W, we take a similar strategy as in [31] to initialize it. 

Parameter updating: In our model, we use the gradient descent to update the model parameters. We also 
tried L-BFGS {U122] to update model parameters, but it did not perform well. In our model, we can update 
the model parameters as follows, 


W <- W - ywtfW, 

9i 9i — jOid9i, l £ {1,..., L} (16) 

where 7 w is the learning rate for the clustering weight W, and 7 is the learning rate for weights 9i in the 
deep neural network. Thus, our method alternates between data projection and parameter optimization. For 
more details, refer to algorithm [l] 

After we learned the model parameters, we can do cluster analysis according to Eq. [2] 


4 Experiments 


In this section, we presented a set of experiments comparing our method to the state of the art semi-supervised 
clustering methods on a wide range of data sets, including UCI data sets and the Reuters dataset in Table[l] 
as well as the MNIST digits, COIL-20 and COIL-100 datasets. We also evaluated whether the transductive 
constraint in Eq. 0 is helpful or not in the clustering analysis. 


4.1 Experimental setup 

In the experiments, we compared our method to the state of the art senri-supervised clustering approaches, 
including Xing [32], ITML §3], KISSME [T7i and CMMC 03]. Note that Xing, ITML and KISSME are the 
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Dataset 


Description 

#vectors 

dim 

^clusters 

Glass 

214 

9 

7 

Wdbc 

569 

32 

2 

Wine 

178 

13 

3 

Sonar 

208 

60 

2 

Image Segmentation 

2310 

19 

7 

Reuters 

8293 

18933 

65 


Table 1: Descriptions of the UCI datasets and the Reuters dataset. 


Methods 

Glass 

Wdbc 

Accuracy (%) 

Wine Sonar Segmentation 

Reuters 

Glass 

Wdbc 

Adjusted Rand Index 

Wine Sonar Segmentation 

Reuters 

Xing [32] 

46.2 

91.9 

81.5 

53.4 

28.5 

44.3 

0.214 

0.70 

0.584 

0.02 

0.12 

0.14 

ITML [6] 

47.4 

92.1 

70.2 

69.2 

30.0 

45.0 

0.223 

0.71 

0.520 

0.14 

0.14 

0.15 

KISSME p2] 

36.5 

77.9 

65.2 

67.3 

27.6 

49.7 

0.07 

0.287 

0.466 

0.12 

0.09 

0.17 

CMMC 021 

43.2 

89.5 

97.1 

72.1 

51.4 

66.5 

0.217 

0.620 

0.918 

0.191 

0.35 

0.22 

Our method 

50.9 

91.5 

98.8 

72.6 

57.1 

72.7 

0.219 

0.689 

0.965 

0.20 

0.41 

0.56 


Table 2: The experimental comparison on the UCI data sets and the Reuters data set. For the real UCI 
datasets, our method outperforms other methods significantly, except on the Glass and Wdbc data sets. Our 
method is remarkably better on the Reuters dataset, with both accuracy and Rand index. 


semi-supervised approaches for metric learning (Mahalanobis). Thus, we used those methods to learn the 
metric and calculate the distances between all instances, then we used the kernel k-means [3 for clustering. 
Our method and CMMC are similar, which can be directly optimized for clustering. 

As for parameter setting, we set A = 0.02 and (3=1. The learning rate yw decreases in the iterations in our 
model, by setting yw = ^ x p +1 ) , where i is the index for iterations; while the learning rate for weighs in the 
deep network fixed, with y g t = 0.01, for l = {1, Without other specification, our model used the one 

hidden layer with 100 units on most data sets, except on the MNIST and UCI data sets. 

We tested our method on two tasks: pairwise classification and clustering analysis. As for pairwise classifica¬ 
tion, we randomly sampled 200 pairs of constraints (around 100 must-links and 100 cannot links), of which we 
used 100 pairwise constraints (50 must-links and 50 cannot links) as the training set, and the rest 100 pairs 
as the testing sets. Then we used the receiver operating characteristic (ROC) to evaluate the performance. 

As for the clustering analysis, we used the pairwise constraints sampled to train our model, then we use 
the learned model for clustering analysis. We used the accuracy (the most possible matching between the 
obtained labels and the original true labels, refer to [34]) and adjusted Rand Index H32 E] to evaluate our 
method in all the experiments. 


4.2 Results 

UCI data sets: In the experiment, we selected the five widely used data sets from the UCI machine learning 
repository which has different dimension and categories, shown in Table [l] As for the number of hidden 
units in our model, we set the number of hidden nodes to be 100 on the sonar data set and 64 on the 
other UCI data sets. For the each data set, we randomly sampled 200 pairwise constraints, of which 100 
pairs were used for training and the rest to test the pairwise classification performance. While for clustering 
performance, we test the model on all the data elements. We compared our method to the state of the art 
methods, and clustering results are shown in Table. ([2| . It demonstrates that our method outperforms other 


1 https://archive.ics.uci.edu/ml/datasets.html 












methods on almost all the data sets, especially for the data with the larger number of classes. We also show 
the performance of our method on the pairwise classification task in Fig. ]T| Except on the Wdbc and glass 
data sets, our method yields completive and even better results than other methods. 

Reuters data set: We used the Reuters2157<0 which has the total 8293 documents with 18933 dimensional 
features for each document, belonging to 65 categories. Because the Reuters data set has high dimension, we 
first projected it into 400 dimensions with PCA. Then we set the number of hidden nodes to be 100 in our 
model. The clustering performance is shown in Table. (§. It demonstrates that our method is significantly 
better than other methods. Again, our method yields remarkably better pairwise classification result, shown 
in the right bottom of Fig. |T] 




False Positive Rate (FPR) False Positive Rate (FPR) 




Figure 1: The pairwise classification results on the five UCI data sets and the Reuters data set (the right 
bottom). 


2 http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html 
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The number of pairwise constraints The number of pairwise constraints 


Figure 2: The clustering performance comparison on the MNIST digits by varying the number of training 
pairs, evaluated with accuracy and rand index respectively. It demonstrates that our method is significantly 
better than other methods for clustering analysis. 


MNIST dataset: The MNIST digit^j consists of 28 x 28-size images of handwriting digits from 0 through 
9 with a training set of 60,000 examples and a test set of 10,000 examples, and has been widely used to test 
character recognition methods. In the experiment, we randomly sampled a subset with 5000 images from the 
training sets to test our method and other baselines. In the experiment, we use a three-layer deep structure 
for MNIST digits, with hidden nodes [400 200 100] respectively on each layer. We tested how the clustering 
performance changes when the number of pairwise constraints varies. The experimental comparisons between 
our method and other baselines are shown in Fig. [2] It demonstrates that the clustering accuracy is increasing 
with more pairwise constraints. And it also shows our method is better than other baselines in most cases 
when varying the number of training pairs. 

To evaluate whether the transductive constraint in our model in Eq. [4] is helpful or not for clustering, we 
set /3 = 0 to get rid of the transductive condition in Eq. [TJ and the experimental results are shown in Fig. 
[3] We argue that the result in Fig. [3] is consistent with common sense. The smaller the number of pairwise 
constraints, the higher uncertainty when we do inference. Thus, transductive learning has no advantage 
when the number of constraints is small. But it performs better with more constraints in Fig. [3] When 
more and more pairwise constraints are available, there’s no need to incorporate transductive principles in 
the model. To sum up, it demonstrates that the transductive constraint in our model is remarkably helpful 
for the semi-supervised clustering analysis. 

COIL data set: We test our method on both COIL-20 and COIL-100 image data sets. The COIL-20 data 
self]] has total 1440 images, with size 128 x 128. It is divided into 20 classes of objects, with 72 images for 
each object. In our experiments, we used the processed version, which contains images for all of the objects 
in which the background has been discarded, and furthermore we resized all the images into 32 x 32 for space 
and time concern. The COIL-100 data set consists of 7200 images, partitioned into 100 classes. Similarly, 
we also resized the images into 32 x 32 before learning clustering model. 

The clustering performance is shown in Fig. [5] and it demonstrates that our method is better than other 
methods with both stability and clustering accuracy. Where’s the performance gain from in Fig. E deep 
learning or transductive learning? To answer this question, we evaluated whether transductive training is 
helpful when the number of pairwise constraints is limited. In Figs. |4][a) and (b), it shows the results on 
20 classes, and demonstrates that transductive learning is helpful when the number of training pairs is in 
range [400 2000]. But with more and more training constraints, transductive learning cannot improve the 
performance too much. In Figs. [4j[c) and (d), it shows the results on COIL with 100 classes, and indicates 

J http://yann.lecun.com/exdb/mnist/ 

4 http:/ /www .cs.columbia.edu/CAVE/software/softlib/coil-20.php 
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Figure 3: The comparison between with and without transductive principles for our method, (a) and (b) 
show the results (evaluated with accuracy and rand index respectively) on the MNIST data set. 


that transductive learning cannot improve the performance much when the number of classes is large. We 
think the reason that transductive learning cannot perform well in Figs. [4])c) and (d) is that it cannot infer 
label well with large margin on the dataset with a larger number of clusters. We argue when we have more 
data and more clusters, it is harder to partition the data well, and more difficult to find a better hyperplane 
to separate one cluster well from the others with large margin. In other words, it is harder to satisfy the 
condition in Eq. JT1 Compared to the COIL dataset, transductive learning yields a larger gain on the MNIST 
data set in Fig. [3j Thus, transductive learning is helpful when the number of classes is small and the data 
is well distributed (compact within the same cluster, and separated between different clusters). 

Thus, the most performance gain on the COIL-100 data set in Fig. [5] is from deep learning, according to the 
above analysis. 



(a) (b) (c) (d) 

Figure 4: The comparison between with and without transductive principles for our method, (a) and (b) 
show the results (evaluated with accuracy and rand index respectively) on the COIL-20 data set; (c) and (d) 
are the results on the COIL-100 dataset, with accuracy and rand index respectively. For the 20 classes, it 
shows that the transductive learning is helpful when the number of training pairs is small. However, for the 
100 classes, the transductive learning cannot improve the performance. It demonstrates that it is better to 
leverage transductive principles when the number of classes is relative smaller. 


5 Conclusions 


In this paper, we propose a deep transductive semi-supervised maximum margin clustering approach. One 
the one hand, we leverage deep learning to learn non-linear representations, which can be used as the input 
to the semi-supervised clustering model. On the other hand, we incorporate the non-label instances into 
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Figure 5: The clustering comparison on the COIL-100 data set by varying the number of training pairs. It 
shows that our method outperforms other methods significantly. 


our semi-supervised clustering framework. Thus, our model unifies transductive learning, deep learning, 
maximum margin and semi-supervised clustering in one framework. Compared to conventional methods, our 
approach can learn non-linear mappings as well as leveraging transductive information to improve clustering 
performance. We pretrain the deep structure with stacked restricted Boltzmann machines layer by layer 
greedily for feature representations and optimize our objective function with gradient decent. We demonstrate 
the advantages of our model over the state of the art in the experiments. 
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